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Abstract 

We introduce in this paper a new algorithm for Multi-Armed Bandit (MAB) problems. A machine 
learning paradigm popular within Cognitive Network related topics (e.g., Spectrum Sensing and Allo- 
cation). We focus on the case where the rewards are exponentially distributed, which is common when 
dealing with Rayleigh fading channels. This strategy, named Multiplicative Upper Confidence Bound 
(MUCB), associates a utility index to every available arm, and then selects the arm with the highest 
index. For every arm, the associated index is equal to the product of a multiplicative factor by the 
sample mean of the rewards collected by this arm. We show that the MUCB policy has a low complexity 
and is order optimal. 

Index Terms 

Learning, Multi-armed bandit, Upper Confidence Bound Algorithm, UCB, MUCB, exponential dis- 
tribution. 

I. Introduction 

Several sequential decision making problems face a dilemma between the exploration of a space of 
choices, or solutions, and the exploitation of the information available to the decision maker. The problem 
described herein is known as sequential decision making under uncertainty. In this paper we focus on a 
sub-class of this problem, where the decision maker has a discrete set of stateless choices and the added 
information is a real valued sequence (of feedbacks, or rewards) that quantifies how well the decision 
maker behaved in the previous time steps. This particular instance of sequential decision making problems 
is generally known as the multi-armed bandit (MAB) problem III, |2). 

A common approach to solving the exploration versus exploitation dilemma within MAB problems 
consists in assigning an utility value to every arm. An arm's utility aggregates all the past information 
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about the lever and quantifies the gambler's interest in pulling it. Such utilities are called indexes. Agrawal 
et al. 13 emphasized the family of indexes minimizing the expected cumulated loss and called them 
Upper Confidence Bound (UCB) indexes. UCB indexes provide an optimistic estimation of the arms' 
performances while ensuring a rapidly decreasing probability of selecting a suboptimal arm. The decision 
maker builds its policy by greedily selecting the largest index. Recently, Auer et al. [3] proved that a simple 
additive form, of the rewards' sample mean and a bias, known as UCB\ can achieve order optimality over 
time when dealing with rewards drawn from bounded distributions. Tackling exponentially distributed 
rewards remains however a challenge as optimal learning algorithms to tackle this matter prove to be 
complex to implement |[ll. El. 

This paper is inspired from the aforementioned work. However, we suggest the analysis of a multi- 
plicative rather than an additive expression for the index. 

The main contribution of this paper is to design and analyze a simple, deterministic, multiplicative 
index-based policy. The decision making strategy computes an index associated to every available arm, 
and then selects the arm with the highest index. Every index associated to an arm is equal to the product 
of the sample mean of the reward collected by this arm and a scaling factor. The scaling factor is chosen 
so as to provide an optimistic estimation of the considered arm's performance. 

We show that our decision policy has both a low computational complexity and can lead to a logarithmic 
loss over time under some non-restrictive conditions. For the rest of this paper we will refer to our 
suggested policy as Multiplicative Upper Confidence Bound index (MUCB). 

The outline of this paper is the following: We start by presenting some general notions on the multi- 
armed bandit framework with exponentially distributed rewards in Section|Il] Then, Section ITIT1 introduces 
our index policy and Section JV] analyzes its behavior, proving the order optimality of the suggested 
algorithm. Finally, Section IVboncludes. 

II. Multi-Armed Bandits 

A if -armed bandit (K G N) is a machine learning problem based on an analogy with the traditional 
slot machine (one-armed bandit) but with more than one lever. Such a problem is defined by the if -tuple 
{61,62, ■■■,6k) £ © > © being the set of all positive reward distributions. When pulled at a time t € N, 
each level] k € [l,Jf] (where [1, if] = {1,...,K}) provides a reward r± drawn from a distribution 6^ 
associated to that specific lever. The objective of the gambler is to maximize the cumulated sum of rewards 

'We use indifferently the words "lever", "arm", or "machine". 
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through iterative pulls. It is generally assumed that the gambler has no (or partial) initial knowledge about 
the levers. The crucial tradeoff the gambler faces at each trial is between exploitation of the lever that 
has the highest expected payoff and exploration to get more information about the expected payoffs of 
the other levers. In this paper, we assume that the different exponentially distributed payoffs drawn from 
a machine are independent and identically distributed (i.i.d.) and that the independence of the rewards 
holds between the machines. However the different machines' reward distributions {0\, 9%, Ojc) are not 
supposed to be the same. 

Let It £ [1, K} denote the machine selected at a time t, and let H t be the history vector available to 
the gambler at instant t, i.e., H t = [I , r , J 1; n, . . . , I t -i,r t -i] 

We assume that the gambler uses a policy ir to select arm I t at instant t, such that I t = ir(H t ). 
We shall also write \fk € [1,-K"], /Ufc=^-=E[#fc], where refers to the parameter of the considered 
exponential distribution with pdf fe k (x) = Xke~ XkX , x > 0, and we assume that fi k > 0,Vfc € 
The (cumulated) regret of a policy ir at time t (after t pulls) is defined as follows: R t = tfi* — X]m=o r ™> 
where u* = max {«*•) refers to the expected reward of the optimal arm. 

We seek to find a policy that minimizes the expected cumulated regret (Equation [T), 

E [R t ] = £ A fc E [T M ] , (1) 

where = p* — p,f, is the expected loss of playing arm k, and T^ t refers to the number of times the 
machine k has been played from instant to instant t — 1. 

III. Multiplicative upper confidence bound algorithms 

This section presents our main contribution, the introduction of a new multiplicative index. Let Bk t {Tk,t) 
denote the index of arm k at time t after being pulled T^ t. We refer to as Multiplicative Upper Confidence 
Bound algorithms (MUCB) the family of indexes that can be written in the form: 

Bk,t = Xk,t{Tk,t)Mkj (Tk,t) , 

where Xk,t(Tk,t) is the sample mean of machine k at step t after Tk t t pulls, i.e., Xk y t{Tk,t) = Si=o ^{i t =k} r i 
and Mfc )t (-) is an upper confidence scaling factor chosen to insure that the index Bk t t(Tk ,t) is an increasing 
function of the number of rounds t. This last property insures that the index of an arm that has not been 
pulled for a long time will increase, thus eventually leading to the sampling of this arm. We introduce a 
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particular parametric class of MUCB indexes, which we call MUCB{a), given as follow 

Va > 0, M ktt (T kjt ) = p 1 i== , (2) 

max|0;(l- • ™> ' 



T fc , t /; 

We adopt the convention that i = +oo. Given a history one can compute the values of T^t and 
Mfrj and derive an index-based policy it as follows: 

It = TT(H t ) e argmax {fl M (T M )} . (3) 
k<={l,K} 

IV. Analysis of MUCB(a) policies 

This section analyses the theoretical properties of MUCB(a) algorithms. More specifically, it focuses 
on determining how fast is the optimal arm identified and what are the probabilities of anomalies, that 
is sub-optimal pulls. 

A. Consistency and order optimality of MUCB indexes 

Definition 1 ( (3 -consistency): Consider the set Q K of i<C-armed bandit problems. A policy ir is said to 
be /3-consistent, < B < 1, with respect to Q K , if and only if 

y(9 1 ,...,e K )€Q K , limM = (4) 

t— >oo t> 

We expect good policies to be at least 1 -consistent. As a matter of fact, 1 -consistency ensures that, 
asymptotically, the average expected reward is optimal. 

From the expression of Equation Q] one can remark that its is sufficient to upper bound the expected 
number of times E[T& t] one plays a suboptimal machine k after t rounds, to obtain an upper bound on 
the expected cumulated regret. This leads to the following theorem. 

Theorem 1 (Order optimality of MUCB (a) policies): Let p k = \i k j\i*, k 6 \ {k*}. For all 

K > 2, if policy MUCBia > 4) is run on K machines having rewards drawn from exponential 
distributions 9\,...,6k then: 

E[Rt]< V -^L]n(t) + o(ln(t)) (5) 



2 This form offers a compact mathematical formula. However practically speaking, a machine k is played when Tk,t < aln(t). 
Otherwise the machine with largest finite index is played. 
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Proving Theorem [T] relies on three lemmas that we analyze and prove in the next subsection. The 
lemma Q] provides a general bound for the regret regardless of the policy considered. The expression 
is function of two probabilities related to learning anomalies. These anomalies depend on the learning 
algorithm. They are introduced and analyzed. Then through lemma [2] ad [3] we upper bound them. 



B. Learning Anomalies and Consistency of MUCB policies 

Let us introduce the set 8 = N x K; then, one can write Sk,t = {T kt t,B kjt ) £ § the decision state of 
arm k at time t. We associate the product order to the set S: for a pair of states S = (T, B) 6 S and 
S' = (T', B') G $, we write S > S' if and only if T > T and B > B' . 

Definition 2 (Anomaly of type I): We assume that there exists at least one suboptimal machine, i.e., 
[1,-fiT] \ {k*} ^ 0. We call anomaly of type 1, denoted by {<j>i(uk)}k t> ^ or a suboptimal machine 
k G \ {k*}, and with parameter Uk G N, the following event: 

= { s k,t > (u k ,H*)} ■ 

Definition 3 (Anomaly of type 2): We refer to as anomaly of type 2, denoted by {4>2}t> associated to 
the optimal machine k*, the following event: 

{fa}* = {S k%t < (oo,//*) n T k%t > 1} . 

Lemma 1 (Expected cumulated regret. Proof in \VI-B\ ): Given a policy n and a MAB problem, let u = 
[til, • • • i uk] represent a set of integers, then the expected cumulated regret is upper bounded by: 

E[R t ] < £ A k u k + J]) A k P t (u k ) 
k ^ k * fc^fc* 



with 



We consider the following values for the set u, for all suboptimal arms k, u k (t) = . ^ a yi ln(t) 
We show in the two following lemmas that for the defined set u the anomalies are upper bounded by 

exponentially decreasing functions of the number of iterations. 

Lemma 2 (Upper bound of Anomaly 1. Proof in \VI-C\ ): For all K > 2, if policy MUCB(a) is run 

on K machines having rewards drawn from exponential distributions 02, 9k then V/c E [1, K} \ {k*}: 

¥{{Mu k m, t )<t- a / 2+i (6) 
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Lemma 3 (Upper bound of Anomaly 2. Proof in \VI-D\> : For all K > 2, if policy MUCB(a) is run 
on K machines having rewards drawn from exponential distributions 9\, ...,6k then: 

p({<MD<^ a/2+1 (7) 



We end this paper by the proof of Theorem Q] 

Proof of Theorem [7J For a > 4, relying on Lemmas [TJ El and [3] we can write: 

4a 



E[Rt] < A k 



+ o(Ht)) 



with > Efc^fc* A fe P t (u fc ) = o(ln(i)). Finally, since A k = p*(l-p k ) and u k (t) = {1 pk y \n{t) + o{ln(t)), 
we find the stated result in Theorem Q] ■ 

V. Conclusion 

A new low complexity algorithm for MAB problems is suggested and analyzed in this paper: MUCB. 
The analysis of its regret proves that the algorithm is order optimality over time. In order to quantify it 
performance compared to optimal algorithms, further empirical evaluations are needed and are currently 
under investigation. 
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VI. Appendix 

A. Large deviations inequalities 

Assumption 1 (Cramer condition): Let X be a real random variable. X satisfies the Cramer condition 
if and only if 

3 7 > : Vr? G (0,7),E [e vX ] < oo . 

Lemma 4 (Cramer-Chernoff Lemma for the sample mean): Let X\, . . . , X n (n G N) be a sequence of 
i.i.d. real random variables satisfying the Cramer condition with expected value E[X]. We denote by X n 
the sample mean X n = i Y^i=\Xi- Then, there exist two functions l\(-) and Z2O) such that: 

V/?i > E[X],F(X n >Pi)< e~ h ^ n , 

V/3 2 < E[X],F(X n < fa) < e~ l ^ )n . 

Functions li(-) and Z2O) do not depend on the sample size n and are continuous non-negative, strictly 
increasing (respectively strictly-decreasing) for all Pi > E(X) (respectively /3 2 < E(X)), both null for 
Pi = P% = E(X). 

This result was initially proposed and proved in [4 |. The bounds provided by this lemma are called 
Large Deviations Inequalities (LDIs) in this paper. 

In the case of exponential distributions this theorem can be applied and LDI functions have the following 
expressions: 

P , ,J ^ \ x 3 ( 1_ ifc 



Up) = UP) = — p- — - - 1 -In -r- > 
B. Proof of Lemma \J\ 

t-i 

According to Equation [TJ E[Rf] = A fc E [T k)t ] . Per definition T k;t = h m =k- Then, E[T k)t ] = 

k^k* m=0 

t-1 

E [l/ m =fc]. After playing an arm Uk times, bounding the first u k terms by 1 yields: 

m=0 

t-1 

E[T M ] < u fe + £ P ({/ m = /c} (~l {T fc , m > (8) 

m=« fc +l 

Then we can notice that the following events are equivalent: 



{J m = k] = { B km > max B k > 

k'jtk 
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Moreover we can notice that: 

{ Bk, m > max£? fc / >m > C {B k)m > B^m} 
Which can be further included in the following union of events: 

{-Bfc.m > -Bfc*,m} C {B k m > fj,*} U {/X* > B k * m } 

Consequently we can write: 

{I m = k}n {T k , m > u k } C {^i{u k ))l m U {$ 2 }™ (9) 

Finally, we apply the probability operator: 

i-i 

E[T k , t ]<u k + Y, P ({^iK)}IJ+P({^ 2 }^) • (10) 

m=u k + l 

The combination of Equation Q] - given at the beginning of this proof - and Equation [TO] concludes this 
proof. 

C. Proof of Lemma\2\ 
From the definition of {<^i{uk)} k t we can write that: 

= E P ( 5 M > ( u k,fJ>*)), 
t-i 

< x: p(5 M («) >m*)- 

In the case of MUCB policies, we have: 

Vu < t, P (^(u) > (j*) = P ( > — 

Consequently, we can upper bound the probability of occurrence of type 1 anomalies by: 

t-i 



({0i(«ft)}fc,t)< E p 



M=M fc 



Let us define P k ,t(T k ,t) = Mk AT k , t y 

Since we are dealing with exponential distributions, the rewards provided by the arm k satisfy the 

Cramer condition. As a matter of fact, since u > u k > a N s then: 

(i-pk) 



Pk,t{u)\k = p k ( 1 - y« 



ln(t) 



> 1 
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So, according to the large deviation inequality for Xkt(Tkt) given by Lemma [4] (with T k> t > u k and Uk 
large enough), there exists a continuous, non-decreasing, non-negative function such that: 

P {X k , t (Tk,t) > f3kA T k,t)\T k , t = u)< e ~hA^Au))u_ 

Finally: 

t-l 

P{{Muk)}Z, t ) < E e- l ^-^ u . (11) 



The end of this proof aims at proving that for u > u k : h,k(Pk,t(u)) > a^T- 

Note that since we are dealing with exponential distributions we can write: h,k(fik,t( u )) > 2(1+2^'.!"^!) ' 
Moreover since u > u k > a ,} n ^\'2 then: 



h,t{u)X k = p k \1 - \l a —^-J < Pk 

Consequently it is sufficient to prove that: 

3(l-/3 M (n)A fc )' m(i) 

-, tt > a 

2(l + 2p fc 1 ) 2u 



Let us define /i(t) as a function of time: h(t) = \J ce l -^p- E [0, 1]. We analyze the sign of the function: 



, h(ty (12) 



Consequently we need to prove that for u > u k , <?(•) has positive values. 
Factorizing last equation leads to the following to terms: 



Pk 



> - m - (Pi 1 - 1) 



p? + Kt) - - i) 

Since per definition: h(t) € [0, 1] and p k x > 1 then, (p^ 1 - yj h(t) - (p^ 1 - l) < 0. 

Consequently, g(-) is positive only if the second term of Equation [T3l is negative, i.e., y a^jp- < 



(13) 



^ — . Since u > u k , the last inequation is verified. Finally upper bounding Equation [TT] 



for u > Ufc: 



M "/2 - fa/2-1 
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D. Proof of Lemma \3\ 

This proof follows the same steps as the the proof in Subsection IVI-CI 

From the definition of {4>i( u k)}l >t we can write that: F ({<MD < Eu=i P ( B )f,t( u ) < /"*) 
In the case of MUCB policies, we have: 

Vu < t, F(B k , tt (u) <fi*)=P (x k *, t (u) < /* 

Consequently, we can upper bound the probability of occurrence of type 2 anomalies by: 



'({«» < £P { < mJ -. (1 - 



u=l 



Since \i* max |o; (1 — y^~)| < ^* Cramer's condition is verified. Moreover since the machine is 
played when the maximal of the previous term is equal to 0, we can consider that u > a ln(t) and that: 



M w|0;(l-^^)j=/ (i-y^J 
Consequently, we can upper-bound the occurrence of Anomaly 2: 

]P({<MD< XI e- l2ifSk ' Au))u (14) 

u=a ln(t) 

Where, hi^k* ,t(u)) verifies the LDI as defined in Appendix IVI-AI Thus, after mild simplifications we 
can write, 

3a ln(f) , / x 

2(1 + 2(1-^)) 2U 
Consequently, including this last inequality into Equation [14] ends the proof. 
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